Microfilm , Paper , and OCR : Issues in Newspaper Digitization
نویسندگان
چکیده
by Kenning Arlitsch and John Herbert Kenning Arlitsch and John Herbert are both at the J. Willard Marriott Library, University of Utah. Mr. Arlitsch (kenning.arlitsch© library.utah.edu 295 S. 1500 East, Room 463, Salt Lake City, UT84112) is Head of Information Technology, and Mr. Herbert (John. [email protected] 295 S. 1500 East, Room 418, Salt Lake City, UT 84112) is Program Director Utah Digital Newspapers. They would like to gratefully acknowledge the contributions of Scott Christensen and Frederick Zarndt of iArchives Inc., and of Randy Silverman, Preservation Librarian at the Marriott Library in the preparation of this manuscript.
منابع مشابه
Full - Text Access to Historical Newspapers Tapas Kanungo and
Newspapers are rich records of U.S. history. Due to the deterioration of older newspapers, the National Endowment for the Humanities is archiving 19th century newspapers on microfilm. Although microfilm is a good preservation method, it provides limited access to researchers and the general public. We are building a system to provide universal access to digital images and full-text content of h...
متن کاملAutomatic Indexing of Newspaper Microfilm Images
This paper describes a proposed document analysis system that aims at automatic indexing of digitized images of old newspaper microfilms. This is done by extracting news headlines from microfilm images. The headlines are then converted to machine readable text by OCR to serve as indices to the respective news articles. A major challenge to us is the poor image quality of the microfilm as most i...
متن کاملGoogle Newspaper Search - Image Processing and Analysis Pipeline
The Google Newspaper Search program was launched on September 8, 2008[1]. In this paper, we outline the technology pieces underlying this large and complex project. We have created a production pipeline which takes newspaper microfilms as input and emits individual news articles as output. These articles are then indexed and added to the content base, so that they turn up in response to Google ...
متن کاملNon-interactive OCR Post-correction for Giga-Scale Digitization Projects
This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Leven...
متن کاملPage 10
Online content-searchable databases of music scores, unlike text databases, are extremely rare. The main reasons are the cost of digitization, the inaccessibility of original music scores and manuscripts, and the lack of sophisticated music recognition software. The proposed research seeks to circumvent these difficulties by investigating the feasibility of using existing microfilms for digitiz...
متن کامل